A recent study published by the Harvard Injury Control Research Center shows that the frequency of mass shooting has increased over time. Some of the deadliest mass shootings in the U.S. happened in February 2018 including the Parkland high school shooting and the massacre at the Tree of Life synagogue in Pittsburgh that occured in October 27, 2018 which prompted widespread national horror. But the death toll is not irrelevant to shooters’ motives. Criminologists who’ve studied them believe that body counts play a role in their calculations.But Mass shooting say a lot about the soceiety, country and how well we are making progress in overcoming loss of life.
Load the data from source.
mass_shootings <- read.csv("~/STA 518/Preparations/Using-a-Machine-Learning-to-Predict-US-Mass-Shooting/Mass Shootings Dataset Ver 5.csv")
Lets import all the necessary libraries which are needed for the data cleaning as well as data vizualizations.
library(data.table) # A faster way to handle data frames in R,Creating data frames
library(readr) # read_csv function file I/O
library(dplyr) # Data Manipulation
library(janitor) # perfectly format data.frame column
library(tidyr) # tidy manipulation
library(tidyverse) #for data cleaning
library(stringr) # to manipulate character type data variables
library(reshape2) # easy to transform data between wide and long formats.
library(lubridate) # For easy handling dates
library(tm) # Text Minning
library(wordcloud) # Form Wordcloud
library(SnowballC) # goes well with wordcloud
library(ggthemes) # For prettier ggplot2 plot aesthetics and acessibility for color-blind palettes
library(plotly) # for interactive visualizations
library(ggplot2) # Data visualization
library(knitr) # For pretty tables
library(scales) # To add more ticks and facilitate plot interpretation
library(lattice) # powerful and elegant data visualization
library(chron) # Provides chronological objects which can handle dates and times.
library(cowplot) # Subplots
library(maptools) # Tools for Handling Spatial Objects
library("leaflet") #For Maps
library(maps) # for map visualization
library(kableExtra)
## Error: package or namespace load failed for 'kableExtra':
## .onLoad failed in loadNamespace() for 'kableExtra', details:
## call: !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output %in%
## error: 'length = 2' in coercion to 'logical(1)'
library(DT) # interactive data table
library(gtable) # construct and manipulate layouts containing graphical elements.
library(grid) # rewrite of the graphics layout capabilities, plus some support for interaction.
library(formattable) #Table
library(ggrepel) # ggrepel provides geoms for ggplot2 to repel overlapping text labels
library(forcats) # Categorical Data Wrangling
library(exploreR)
First step is to clean the variable name so that it is easy to work with.
mass_shootings <- clean_names(mass_shootings)
#Data Modelling The dataset contains information about . The dataset contains Serial No, Title, Location, Date, Summary, Fatalities, Injured, Total Victims, Mental Health Issue, Race, Gender, and Lat-Long information.All the variables are quite self explanatory.
(head(mass_shootings,5))
## s title location date
## 1 1 Texas church mass shooting Sutherland Springs, TX 11/5/2017
## 2 2 Walmart shooting in suburban Denver Thornton, CO 11/1/2017
## 3 3 Edgewood businees park shooting Edgewood, MD 10/18/2017
## 4 4 Las Vegas Strip mass shooting Las Vegas, NV 10/1/2017
## 5 5 San Francisco UPS shooting San Francisco, CA 6/14/2017
## incident_area open_close_location target
## 1 Church Close random
## 2 Wal-Mart Open random
## 3 Remodeling Store Close coworkers
## 4 Las Vegas Strip Concert outside Mandala Bay Open random
## 5 UPS facility Close coworkers
## cause
## 1 unknown
## 2 unknown
## 3 unknown
## 4 unknown
## 5
## summary
## 1 Devin Patrick Kelley, 26, an ex-air force officer, shot and killed 26 people and wounded 20 at a church in Texas. He was found dead later in his vehicle.
## 2 Scott Allen Ostrem, 47, walked into a Walmart in a suburb north of Denver and fatally shot two men and a woman, then left the store and drove away. After an all-night manhunt, Ostrem, who had financial problems but no serious criminal history, was captured by police after being spotted near his apartment in Denver.
## 3 Radee Labeeb Prince, 37, fatally shot three people and wounded two others around 9am at Advance Granite Solutions, a home remodeling business where he worked near Baltimore. Hours later he shot and wounded a sixth person at a car dealership in Wilmington, Delaware. He was apprehended that evening following a manhunt by authorities.
## 4 Stephen Craig Paddock, opened fire from the 32nd floor of Manadalay Bay hotel at Last Vegas concert goers for no obvious reason. He shot himself and died on arrival of law enforcement agents. He was 64
## 5 Jimmy Lam, 38, fatally shot three coworkers and wounded two others inside a UPS facility in San Francisco. Lam killed himself as law enforcement officers responded to the scene.
## fatalities injured total_victims policeman_killed age employeed_y_n
## 1 26 20 46 0 26 NA
## 2 3 0 3 0 47 NA
## 3 3 3 6 0 37 NA
## 4 59 527 585 1 64 NA
## 5 3 2 5 0 38 1
## employed_at mental_health_issues race gender latitude longitude
## 1 No White M NA NA
## 2 No White M NA NA
## 3 Advance Granite Store No Black M NA NA
## 4 Unclear White M 36.18127 -115.1341
## 5 Yes Asian M NA NA
Lets extract some information about our datasets. Let's look at some missing data
str(mass_shootings) # Display the structure of the dataset
## 'data.frame': 323 obs. of 21 variables:
## $ s : int 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr "Texas church mass shooting" "Walmart shooting in suburban Denver" "Edgewood businees park shooting" "Las Vegas Strip mass shooting" ...
## $ location : chr "Sutherland Springs, TX" "Thornton, CO" "Edgewood, MD" "Las Vegas, NV" ...
## $ date : chr "11/5/2017" "11/1/2017" "10/18/2017" "10/1/2017" ...
## $ incident_area : chr "Church" "Wal-Mart" "Remodeling Store" "Las Vegas Strip Concert outside Mandala Bay" ...
## $ open_close_location : chr "Close" "Open" "Close" "Open" ...
## $ target : chr "random" "random" "coworkers" "random" ...
## $ cause : chr "unknown" "unknown" "unknown" "unknown" ...
## $ summary : chr "Devin Patrick Kelley, 26, an ex-air force officer, shot and killed 26 people and wounded 20 at a church in Texas. He was found "Scott Allen Ostrem, 47, walked into a Walmart in a suburb north of Denver and fatally shot two men and a woman, then left the s "Radee Labeeb Prince, 37, fatally shot three people and wounded two others around 9am at Advance Granite Solutions, a home remod "Stephen Craig Paddock, opened fire from the 32nd floor of Manadalay Bay hotel at Last Vegas concert goers for no obvious reason ...
## $ fatalities : int 26 3 3 59 3 3 5 3 3 5 ...
## $ injured : int 20 0 3 527 2 0 0 0 0 6 ...
## $ total_victims : int 46 3 6 585 5 3 5 3 3 11 ...
## $ policeman_killed : int 0 0 0 1 0 NA NA 1 NA NA ...
## $ age : chr "26" "47" "37" "64" ...
## $ employeed_y_n : int NA NA NA NA 1 1 1 1 NA NA ...
## $ employed_at : chr "" "" "Advance Granite Store" "" ...
## $ mental_health_issues: chr "No" "No" "No" "Unclear" ...
## $ race : chr "White" "White" "Black" "White" ...
## $ gender : chr "M" "M" "M" "M" ...
## $ latitude : num NA NA NA 36.2 NA ...
## $ longitude : num NA NA NA -115 NA ...
summary(mass_shootings) # Provide summary statistics for each variable in the dataset
## s title location date
## Min. : 1.0 Length:323 Length:323 Length:323
## 1st Qu.: 81.5 Class :character Class :character Class :character
## Median :162.0 Mode :character Mode :character Mode :character
## Mean :162.0
## 3rd Qu.:242.5
## Max. :323.0
##
## incident_area open_close_location target cause
## Length:323 Length:323 Length:323 Length:323
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## summary fatalities injured total_victims
## Length:323 Min. : 0.000 Min. : 0.000 Min. : 3.00
## Class :character 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 4.00
## Mode :character Median : 3.000 Median : 3.000 Median : 5.00
## Mean : 4.437 Mean : 6.176 Mean : 10.26
## 3rd Qu.: 5.500 3rd Qu.: 5.000 3rd Qu.: 9.00
## Max. :59.000 Max. :527.000 Max. :585.00
##
## policeman_killed age employeed_y_n employed_at
## Min. :0.0000 Length:323 Min. :0.0000 Length:323
## 1st Qu.:0.0000 Class :character 1st Qu.:0.0000 Class :character
## Median :0.0000 Mode :character Median :1.0000 Mode :character
## Mean :0.1293 Mean :0.6269
## 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :5.0000 Max. :1.0000
## NA's :6 NA's :256
## mental_health_issues race gender latitude
## Length:323 Length:323 Length:323 Min. :21.33
## Class :character Class :character Class :character 1st Qu.:33.57
## Mode :character Mode :character Mode :character Median :36.44
## Mean :37.23
## 3rd Qu.:41.48
## Max. :60.79
## NA's :20
## longitude
## Min. :-161.79
## 1st Qu.:-110.21
## Median : -88.12
## Mean : -94.43
## 3rd Qu.: -81.70
## Max. : -69.71
## NA's :20
names(mass_shootings) # Display the names of the variables in the dataset
## [1] "s" "title" "location"
## [4] "date" "incident_area" "open_close_location"
## [7] "target" "cause" "summary"
## [10] "fatalities" "injured" "total_victims"
## [13] "policeman_killed" "age" "employeed_y_n"
## [16] "employed_at" "mental_health_issues" "race"
## [19] "gender" "latitude" "longitude"
sapply(mass_shootings, function(x) sum(is.na(x)))
## s title location
## 0 0 0
## date incident_area open_close_location
## 0 0 0
## target cause summary
## 0 0 0
## fatalities injured total_victims
## 0 0 0
## policeman_killed age employeed_y_n
## 6 0 256
## employed_at mental_health_issues race
## 0 0 0
## gender latitude longitude
## 0 20 20
duplicate_data <- mass_shootings[duplicated(mass_shootings$s_number), ]
# Display duplicate rows
print(duplicate_data)
## [1] s title location
## [4] date incident_area open_close_location
## [7] target cause summary
## [10] fatalities injured total_victims
## [13] policeman_killed age employeed_y_n
## [16] employed_at mental_health_issues race
## [19] gender latitude longitude
## <0 rows> (or 0-length row.names)
# Number of duplicate rows
print(nrow(duplicate_data))
## [1] 0
We don’t have any duplicate datasets. All the incident in our datasets are unique. Lets also see if our datasets are recorded in chronological order.
Lets take a quick look at our datasets. Peek a boo!
# Lets take a peek of our datasets
glimpse(mass_shootings)
## Rows: 323
## Columns: 21
## $ s <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ title <chr> "Texas church mass shooting", "Walmart shooting i…
## $ location <chr> "Sutherland Springs, TX", "Thornton, CO", "Edgewo…
## $ date <chr> "11/5/2017", "11/1/2017", "10/18/2017", "10/1/201…
## $ incident_area <chr> "Church", "Wal-Mart", "Remodeling Store", "Las Ve…
## $ open_close_location <chr> "Close", "Open", "Close", "Open", "Close", "Close…
## $ target <chr> "random", "random", "coworkers", "random", "cowor…
## $ cause <chr> "unknown", "unknown", "unknown", "unknown", "", "…
## $ summary <chr> "Devin Patrick Kelley, 26, an ex-air force office…
## $ fatalities <int> 26, 3, 3, 59, 3, 3, 5, 3, 3, 5, 5, 3, 5, 49, 0, 1…
## $ injured <int> 20, 0, 3, 527, 2, 0, 0, 0, 0, 6, 0, 3, 11, 53, 4,…
## $ total_victims <int> 46, 3, 6, 585, 5, 3, 5, 3, 3, 11, 5, 6, 16, 102, …
## $ policeman_killed <int> 0, 0, 0, 1, 0, NA, NA, 1, NA, NA, NA, 3, 5, 0, 0,…
## $ age <chr> "26", "47", "37", "64", "38", "24", "45", "43", "…
## $ employeed_y_n <int> NA, NA, NA, NA, 1, 1, 1, 1, NA, NA, NA, NA, NA, N…
## $ employed_at <chr> "", "", "Advance Granite Store", "", "", "Weis gr…
## $ mental_health_issues <chr> "No", "No", "No", "Unclear", "Yes", "Unclear", "U…
## $ race <chr> "White", "White", "Black", "White", "Asian", "Whi…
## $ gender <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",…
## $ latitude <dbl> NA, NA, NA, 36.18127, NA, NA, NA, NA, NA, NA, NA,…
## $ longitude <dbl> NA, NA, NA, -115.13413, NA, NA, NA, NA, NA, NA, N…
Lets first clean up our data before we vizualise it. There are lot of variables that can be imputed as well we need to clean up the missing values as well as create new variables.
# Factor conversion
mass_shootings$date <- lubridate::mdy(mass_shootings$date) # Lets covert date to date from charcter.
mass_shootings$summary <- as.factor(mass_shootings$summary) # convert summary to factor Variables
Lets see the map to get a sense
world.map <- map_data("state")
ggplot() +
geom_map(data = world.map, map = world.map,
aes(x = long, y = lat, group = group, map_id = region),
fill = "white", colour = "black") +
geom_point(data = mass_shootings,
aes(x = longitude, y = latitude, size = fatalities),
color = "firebrick", alpha = 0.6) +
xlim(-130, -65) +
ylim(25, 50) +
labs(title = "All Mass Shootings that occurred between 1966 to 2017",
subtitle = "Size represents the fatalities") +
theme_fivethirtyeight()
## Warning in geom_map(data = world.map, map = world.map, aes(x = long, y = lat, :
## Ignoring unknown aesthetics: x and y
## Warning: Removed 22 rows containing missing values (`geom_point()`).
Date
Lets start with date column. We will sepearte date column and create year,month,day, weekdays, day of week variables.According to the research, the days separating mass shooting occurrence went from on average 200 days during the period of 1983 to 2011 to 64 days since 2011.
mass_shootings <- mass_shootings %>%
dplyr::mutate(year = year(date),
month = month(date),
month_name = month.abb[month],
day = day(date),
Weekday = weekdays(date),
dow = lubridate::wday(date),
Week = week(date))
Lets look at some of the severe Mass Shooting in USA history from our datasets. Datasets is up until last Las Vegas Shootings.
inci_dents <- mass_shootings %>%
select(title,location,total_victims) %>%
arrange(desc(total_victims)) %>%
head(10) # Lets pick only top ten otherwise it would be overcrowded.
inci_dents <- inner_join(inci_dents,mass_shootings,by=c("location","title"))
#Select column of interest
inci_dents <- inci_dents %>%
select(title,location,total_victims.x,month,day,year,fatalities,injured)
formattable(inci_dents,align="l",list(total_victims.x=color_tile("lightsalmon","red"),fatalities=color_bar("cyan"),injured=color_bar("darkorchid")))
| title | location | total_victims.x | month | day | year | fatalities | injured |
|---|---|---|---|---|---|---|---|
| Las Vegas Strip mass shooting | Las Vegas, NV | 585 | 10 | 1 | 2017 | 59 | 527 |
| Orlando nightclub massacre | Orlando, Florida | 102 | 6 | 12 | 2016 | 49 | 53 |
| Aurora theater shooting | Aurora, Colorado | 82 | 7 | 20 | 2012 | 12 | 70 |
| Virginia Tech massacre | Blacksburg, Virginia | 55 | 4 | 16 | 2007 | 32 | 23 |
| University of Texas at Austin | Austin, Texas | 48 | 8 | 1 | 1966 | 17 | 32 |
| Texas church mass shooting | Sutherland Springs, TX | 46 | 11 | 5 | 2017 | 26 | 20 |
| Fort Hood Army Base | Fort Hood, Texas | 45 | 11 | 5 | 2009 | 13 | 32 |
| Luby’s Cafeteria in Killeen, Texas | Killeen, Texas | 43 | 10 | 16 | 1991 | 24 | 20 |
| McDonald’s restaurant in San Ysidro | San Ysidro, California | 40 | 7 | 18 | 1984 | 22 | 19 |
| Columbine High School | Littleton, Colorado | 37 | 4 | 20 | 1999 | 15 | 24 |
# Word Cloud
Corpus <- Corpus(VectorSource(mass_shootings$title))
Corpus <- Corpus(VectorSource(Corpus))
Corpus <- tm_map(Corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(Corpus, removePunctuation): transformation drops
## documents
Corpus <- tm_map(Corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(Corpus, stripWhitespace): transformation drops
## documents
Corpus <- tm_map(Corpus, removeWords, stopwords('english'))
## Warning in tm_map.SimpleCorpus(Corpus, removeWords, stopwords("english")):
## transformation drops documents
Corpus <- tm_map(Corpus, removeWords, 'shooting')
## Warning in tm_map.SimpleCorpus(Corpus, removeWords, "shooting"): transformation
## drops documents
wordcloud(Corpus, scale=c(5,0.5), max.words = 200, use.r.layout=FALSE,
rot.per = 0.3, random.order = FALSE, colors=brewer.pal(8, 'Dark2'))
Like mass shootings in general, school shootings have gone from being a rare tragedy to a tragic reality. Already in 2018 there have been more than a dozen instances of gun violence in U.S. schools.We may be unsure where to even begin with such a heavy topic. Consider asking our kids what their questions are before you give our two cents.But, yet, there are many individuals who suffered childhood and adulthood traumas and severe abuse and they did not become rampage shooters. They developed healthy relationships with others. Are school shooters unable to develop healthy relationships with others? ## Total Fatalities
Lets Look at the Total Fatalities Number over the course of year. Lets use Line graph and bar plot to show the
#1. Plot of # number of shootings by year and total victims
fatalities_yr<- mass_shootings %>%
select(year, fatalities, injured, total_victims) %>%
melt(id.vars=c("year"), measure.vars = c("fatalities","injured","total_victims"))
p <- ggplot(data = fatalities_yr %>%
group_by(year, variable) %>%
summarize(shoot_ings = length(year),fatalities = sum(value)),
aes(x=year, y=shoot_ings)) +
geom_line(size=.5,color="black") +
geom_point(shape=18,size=1, fill="black",stroke=1.2, color = "black") +
theme_fivethirtyeight() +
scale_fill_gdocs() +
ggtitle("Total Mass Shooting \n between 1966-2017")
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
q <- ggplot() +
geom_bar(data = fatalities_yr %>%
filter(variable == "total_victims") %>%
group_by(year, variable) %>%
summarize(shoot_ings = length(year), fatalities = sum(value)),
aes(x=year, y=fatalities, fill = variable), stat="identity") +
theme_fivethirtyeight() +
guides(fill = FALSE)+ # Remove legends
ggtitle("Total victims 1966-2017")
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot_grid(p, q, labels = c('A', 'B')) # We can see side by side.
It is clear from above bar graph that shooting is increasing over the course of years. How does the Gender play a role in it?
We will basically sum up into 4 variables. Either the person is Male, Female, Male/Female or Unkown to keep our analysis simple.
# Gender
# Before Clean up
#table(mass_shootings$gender) # Uncomment to see how it looks before.
# Lest recode Male,Female, Male/Female and Unknown.
mass_shootings$gender[mass_shootings$gender=="M"] <- "Male"
mass_shootings$gender[mass_shootings$gender=="F"] <- "Female"
mass_shootings$gender[mass_shootings$gender=="M/F"] <- "Male/Female"
mass_shootings$gender[mass_shootings$gender==""] <- "Unknown"
# After Cleanup
table(mass_shootings$gender)
##
## Female Male Male/Female Unknown
## 5 292 5 21
# nlevels(mass_shootings$gender) # should give us 4 if you want to double check
# sum(table(mass_shootings$gender))==nrow(mass_shootings) # Sanity check
Lets clean up the cause variables and aggregate into same bin which makes more sense to bucket in.
How many Mental Health cases do we know about shooters?
table(mass_shootings$mental_health_issues)
##
## No Unclear unknown Unknown Yes
## 93 13 1 110 106
mass_shootings$mental_health_issues[mass_shootings$mental_health_issues %in%c( "Unclear","unknown")] <-"Unknown"
What are the most common target place choosed by Shooters?
# Target
# sort(xtabs(formula = ~target, data = mass_shootings), decreasing = TRUE) # Uncomment to see
#Preprocessing Target
mass_shootings$target[mass_shootings$target == "random"] <- "Random"
mass_shootings$target[mass_shootings$target == "NA"] <- "Unknown"
mass_shootings$target[mass_shootings$target %in%c("Family","neighbors","Children","Family/Neighbors","Family+random","Friends","Family+students","partner's family","women")]<-"Friends & Family"
mass_shootings$target[mass_shootings$target %in% c("Coworkers","Ex-Coworkers","coworkers","Coworker's Family")]<-"Coworkers"
mass_shootings$target[mass_shootings$target %in% c("Students","Students+Teachers","Teachers","school girls","Students+Parents","Family+students")]<-"School"
mass_shootings$target[mass_shootings$target %in% c("Ex-Wife","Ex-Wife & Family","Ex-Girlfriend","Ex-girlfriend","Ex-GirlFriend","Girlfriend","Ex-Girlfriend+random","Ex-Girlfriend & Family")]<-"Ex- GF/Wife"
mass_shootings$target[mass_shootings$target %in% c("Policeman","police","Marines","Policeman+Council Member", "TSA Officer","Trooper","Social Workers")]<-"Police/Trooper"
mass_shootings$target[mass_shootings$target %in% c("party guests","uninvited guests","club members","birthday party bus")]<-"Parties"
mass_shootings$target[mass_shootings$target%in% c("Sikhs","monks","prayer group")]<-"Religious Group"
mass_shootings$target[mass_shootings$target%in% c("welding shop employees","hunters","lawyers","black men","Congresswoman","Contestant","drug dealer", "House Owner", "protestors","postmaster","rapper+random","psychologist+psychiatrist","basketball players")]<-"Other"
sort(table(mass_shootings$target), decreasing = TRUE)
##
## Random Friends & Family School Coworkers
## 140 50 38 32
## Ex- GF/Wife Police/Trooper Other Parties
## 17 14 13 11
## Religious Group
## 5 3
Where does Shooter prefer to choose as choice of their location?
#Open_close_location
# sort(xtabs(formula = ~open_close_location, data = df_1), decreasing = TRUE)
mass_shootings$open_close_location[mass_shootings$open_close_location == "Open+CLose"] <- "Open+Close"
sort(xtabs(formula = ~open_close_location, data = mass_shootings), decreasing = TRUE)
## open_close_location
## Close Open Open+Close
## 197 78 28 20
# Location
# Lets keep the location column just by passing remove = FALSE
mass_shootings<- mass_shootings %>% separate(location , into = c("city", "state"), sep = ",", remove = FALSE) # split the location
## Warning: Expected 2 pieces. Additional pieces discarded in 4 rows [147, 176,
## 225, 241].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 46 rows [16, 17, 18, 19,
## 20, 21, 22, 23, 24, 25, 26, 30, 35, 36, 37, 38, 39, 41, 43, 44, ...].
mass_shootings <- mass_shootings %>% mutate(city = tolower(city))# keep all the elemet to lower case
We get additional error for 4 rows [147, 176, 225, 241]. Lets dive into those column and see why we are getting that error.
# Lets look into those rows why it is generating error
mass_shootings[c(147, 176, 225, 241), c("location", "city", "state")]
## location city
## 147 Pennsburg, Souderton, Lansdale, Harleysville, Pennsylvania pennsburg
## 176 South Valley, Albuquerque, New Mexico south valley
## 225 Nickel Mines, Lancaster, Pennsylvania nickel mines
## 241 Santee, San Diego, California santee
## state
## 147 Souderton
## 176 Albuquerque
## 225 Lancaster
## 241 San Diego
We can see in these 4 rows we have multiple comma to seperate thats why it throws an error. It took first comma to split into city and state. We can fix these by making seperate more friendly.
# Now we are splitting based on last comma to city and state.
mass_shootings<- separate(mass_shootings, location, c("city", "state"), sep = ",(?=[^,]+$)", remove=FALSE)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 46 rows [16, 17, 18, 19,
## 20, 21, 22, 23, 24, 25, 26, 30, 35, 36, 37, 38, 39, 41, 43, 44, ...].
mass_shootings[c(147, 176, 225, 241), c("location", "city", "state")]
## location
## 147 Pennsburg, Souderton, Lansdale, Harleysville, Pennsylvania
## 176 South Valley, Albuquerque, New Mexico
## 225 Nickel Mines, Lancaster, Pennsylvania
## 241 Santee, San Diego, California
## city state
## 147 Pennsburg, Souderton, Lansdale, Harleysville Pennsylvania
## 176 South Valley, Albuquerque New Mexico
## 225 Nickel Mines, Lancaster Pennsylvania
## 241 Santee, San Diego California
We can also find out how many unique values are there for any Location if State is NA:
sort(table(mass_shootings$state),decreasing = TRUE)
##
## California Florida Texas Washington Georgia
## 29 20 16 14 13
## Arizona North Carolina New York Ohio Alabama
## 11 11 10 10 9
## Illinois Wisconsin Pennsylvania Colorado Michigan
## 9 9 8 6 6
## Kentucky Nevada Oklahoma South Carolina Tennessee
## 5 5 5 5 5
## Virginia Kansas Louisiana Massachusetts Minnesota
## 5 4 4 4 4
## Mississippi Oregon Connecticut Missouri Nebraska
## 4 4 3 3 3
## New Jersey New Mexico Arkansas CA Montana
## 3 3 2 2 2
## Utah Virginia Alaska CO Hawaii
## 2 1 1 1 1
## Idaho Indiana Iowa LA Maine
## 1 1 1 1 1
## MD NV PA South Dakota Texas
## 1 1 1 1 1
## TX Vermont WA West Virginia Wyoming
## 1 1 1 1 1
But We know that Washington DC is not Considered as State. A state which is referred as “Taxation Without Representation”. Lets find out which column is it. We can ignore it or make a seperate state just for analysis.
unique((mass_shootings[is.na(mass_shootings$state),c("location", "city", "state")])[ ,1]) # 1 Is for location, 2 is for city.
## [1] "" "Washington D.C."
mass_shootings %>% filter(location == "Washington D.C.")
## s title location city state date
## 1 164 Washington Navy Yard Washington D.C. Washington D.C. <NA> 2013-09-16
## incident_area open_close_location target cause
## 1 Close Random terrorism
## summary
## 1 On September 16, 2013, a 34-yearl old contractor for the Navy Yard in Washington D.C, entered the building shooting randomly, killing twelve people and injuring three surviving victims before he was shot and killed by law enforcement officers. Five other people were injured during the incident while trying to escape.
## fatalities injured total_victims policeman_killed age employeed_y_n
## 1 13 3 15 0 34 1
## employed_at mental_health_issues race gender
## 1 Navy Yard Yes Black American or African American Male
## latitude longitude year month month_name day Weekday dow Week
## 1 38.90481 -77.0163 2013 9 Sep 16 Monday 2 37
When We break down by the states which states have most of the Shootings.
# sort(table(mass_shootings$state),decreasing = TRUE)
# sum(is.na(mass_shootings$state)) # we have 46 Missing
p <-mass_shootings %>%
filter(!is.na(state)) %>%
group_by(state) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
ungroup() %>%
mutate(state = reorder(state,count)) %>%
head(10) %>%
ggplot(aes(x = state,y = count)) +
geom_bar(stat='identity',color="white", fill = "lightslategray") +
geom_text(aes(x = state, y = 1, label = count),
hjust=0, vjust=.5, size = 4, color = 'black',
fontface = 'bold') +
labs(x = 'States',
y = 'No. of Shootings',
title = 'Number of Shootings/State') +
coord_flip() +
theme_fivethirtyeight()
ggplotly(p)
# Percentage by race
perc_state<- mass_shootings%>%
filter(!is.na(state)) %>%
group_by(state) %>%
summarise(count=n()) %>%
mutate(perc=round((count/sum(count))*100),"%") %>%
arrange(desc(count)) %>%
head(10)
formattable(perc_state,align=c("l","r","r"),list(count=color_bar("mediumslateblue"),perc=color_tile("white","lightsalmon")))
| state | count | perc | “%” |
|---|---|---|---|
| California | 29 | 10 | % |
| Florida | 20 | 7 | % |
| Texas | 16 | 6 | % |
| Washington | 14 | 5 | % |
| Georgia | 13 | 5 | % |
| Arizona | 11 | 4 | % |
| North Carolina | 11 | 4 | % |
| New York | 10 | 4 | % |
| Ohio | 10 | 4 | % |
| Alabama | 9 | 3 | % |
California,Florida and Texas have the most shooters.
Now we are left with empty values in Location, and there are 45 of those. We can deal with those missing states with he use of Latitude and Longitude data which is avialable to us.But for these we can use google geocode variables but you need API for that.
For all these observations, Location is empty. But coordinates are given in Longitude and Latitude. This is one way to derive address from coordinates to further be used for filling State and City values:
# Convert age variable to numeric
mass_shootings$age <- as.numeric(mass_shootings$age)
## Warning: NAs introduced by coercion
# Impute missing values of Age with mean value of non-missing ages
mass_shootings$age <- ifelse(is.na(mass_shootings$age),
mean(mass_shootings$age, na.rm = TRUE),
mass_shootings$age)
# Filter mass shootings based on age groups
shootings_less_than_20 <- mass_shootings %>% filter(between(age, 0, 20))
shootings_20_30 <- mass_shootings %>% filter(between(age, 20, 30))
shootings_30_40 <- mass_shootings %>% filter(between(age, 30, 40))
shootings_40_50 <- mass_shootings %>% filter(between(age, 40, 50))
shootings_50_60 <- mass_shootings %>% filter(between(age, 50, 60))
shootings_60_70 <- mass_shootings %>% filter(between(age, 60, 70))
shootings_70_80 <- mass_shootings %>% filter(between(age, 70, 80))
mass_shootings <- mass_shootings %>%
mutate(decade = (case_when(year >= 1960 & year < 1970 ~ "1960s",
year >= 1970 & year < 1980 ~ "1970s",
year >= 1980 & year < 1990 ~ "1980s",
year >= 1990 & year < 2000 ~ "1990s",
year >= 2000 & year < 2010 ~ "2000s",
year >= 2010 & year < 2020 ~ "2010s")))
decade_boxplot <- mass_shootings %>%
ggplot(aes(x =decade, y = age, fill = decade)) +
geom_boxplot() +
labs(x = "Each Decade", title = "Age Distribution of Mass Shooter decadewise")+
theme_fivethirtyeight()
ggplotly(decade_boxplot)
## Warning: plotly.js does not (yet) support horizontal legend items
## You can track progress here:
## https://github.com/plotly/plotly.js/issues/53
p <-mass_shootings %>%
filter(!is.na(race)) %>%
group_by(race) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(race = reorder(race,count)) %>%
ungroup() %>%
ggplot(aes(x = race,y = count)) +
geom_bar(stat='identity',color="white", fill = "#D8BFD8") +
geom_text(aes(x = race, y = 10, label = count),
hjust=0, vjust=.5, size = 4, color = 'black',
fontface = 'bold') +
coord_flip() +
labs(x = 'Race',
y = 'No. of Shootings',
title = 'Number of Shootings/Race')+
theme_fivethirtyeight()
ggplotly(p)
# Percentage by race
perc_race <- mass_shootings%>%
group_by(race) %>%
summarise(count=n()) %>%
mutate(perc=round((count/sum(count))*100)) %>%
arrange(desc(count))
formattable(perc_race,align=c("l","r","r"),list(count=color_bar("mediumslateblue"),perc=color_tile("white","lightsalmon")))
| race | count | perc |
|---|---|---|
| White American or European American | 144 | 45 |
| Black American or African American | 85 | 26 |
| Unknown | 44 | 14 |
| Other | 24 | 7 |
| Asian or Asian American | 18 | 6 |
| Latino | 5 | 2 |
| Native American or Alaska Native | 3 | 1 |
# Pie chart
shooter <- data.frame(table(mass_shootings$race))
colnames(shooter) <- c("race","freq")
ggplot(shooter,aes(x=race,
y=freq,
fill=race))+
theme(legend.position="none")+
geom_bar(stat="identity",width = 1)+
coord_polar(theta = "y")
The statistic shows the number of mass shootings in the United States between 1982 and November 19, 2018, by race and ethnicity of the shooter(s). Between 1982 and November 2018, 60 out of 107 mass shootings were initiated by White shooters. The Las Vegas strip massacre in 2017 had the highest number of victims between 1982 and 2018, with 58 people killed, and over 500 injured.
Outer legends represents the White American or European American.
Is it fare to say that some other races are also participating in Mass Shooting beside Whites.
First, let’s split the data into training and testing sets
# Load the caret package
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Set seed for reproducibility
set.seed(6496)
#Split the data
# Split the data into 70% training and 30% testing
train_index <- createDataPartition(incident_data$fatal_incident, p = 0.7, list = FALSE)
## Error in eval(expr, envir, enclos): object 'incident_data' not found
train_data <- incident_data[train_index, ]
## Error in eval(expr, envir, enclos): object 'incident_data' not found
test_data <- incident_data[-train_index, ]
## Error in eval(expr, envir, enclos): object 'incident_data' not found
Now, let’s train the logistic regression model using the training data.
incident_data <- mass_shootings
# Convert target variable to binary (1 for incidents with fatalities, 0 otherwise)
incident_data$fatal_incident <- ifelse(incident_data$fatalities > 0, 1, 0)
# Select relevant features for the model
features <- c("age", "gender", "race", "mental_health_issues")
# Filter out rows with missing values in the selected features
incident_data <- na.omit(incident_data[, c(features, "fatal_incident")])
# Train-test split
set.seed(123) # for reproducibility
train_indices <- sample(nrow(incident_data), 0.7 * nrow(incident_data)) # 70% train, 30% test
train_data <- incident_data[train_indices, ]
test_data <- incident_data[-train_indices, ]
# Train the logistic regression model
logistic_model <- glm(fatal_incident ~ ., data = train_data, family = binomial(link = "logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Summary of the model
summary(logistic_model)
##
## Call:
## glm(formula = fatal_incident ~ ., family = binomial(link = "logit"),
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.496e+00 8.330e+03 -0.001 0.99947
## age 1.436e-01 4.469e-02 3.214 0.00131
## genderMale 2.122e+01 2.509e+03 0.008 0.99325
## genderMale/Female 3.989e+01 1.463e+04 0.003 0.99782
## genderUnknown 1.966e+01 2.509e+03 0.008 0.99375
## raceBlack American or African American -1.971e+01 7.943e+03 -0.002 0.99802
## raceLatino -2.246e+00 1.582e+04 0.000 0.99989
## raceNative American or Alaska Native 1.929e+01 1.643e+04 0.001 0.99906
## raceOther -1.710e+01 7.943e+03 -0.002 0.99828
## raceUnknown -1.918e+01 7.943e+03 -0.002 0.99807
## raceWhite American or European American -1.748e+01 7.943e+03 -0.002 0.99824
## mental_health_issuesUnknown 5.359e-01 5.591e-01 0.958 0.33784
## mental_health_issuesYes 3.738e+01 3.482e+03 0.011 0.99143
##
## (Intercept)
## age **
## genderMale
## genderMale/Female
## genderUnknown
## raceBlack American or African American
## raceLatino
## raceNative American or Alaska Native
## raceOther
## raceUnknown
## raceWhite American or European American
## mental_health_issuesUnknown
## mental_health_issuesYes
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 173.20 on 225 degrees of freedom
## Residual deviance: 109.91 on 213 degrees of freedom
## AIC: 135.91
##
## Number of Fisher Scoring iterations: 20
# Predict on test data
predicted_probs <- predict(logistic_model, newdata = test_data, type = "response")
# Convert predicted probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)
#Test the Model
# Compute accuracy
accuracy <- mean(predicted_classes == test_data$fatal_incident)
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.9072165
# Compute confusion matrix
conf_matrix <- table(Actual = test_data$fatal_incident, Predicted = predicted_classes)
print(conf_matrix)
## Predicted
## Actual 0 1
## 0 3 9
## 1 0 85
# Compute precision, recall, and F1-score
precision <- conf_matrix[2, 2] / sum(predicted_classes)
recall <- conf_matrix[2, 2] / sum(test_data$fatal_incident)
f1_score <- 2 * precision * recall / (precision + recall)
cat("Precision:", precision, "\n")
## Precision: 0.9042553
cat("Recall:", recall, "\n")
## Recall: 1
cat("F1-score:", f1_score, "\n")
## F1-score: 0.9497207
# Plot ROC curve
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_curve <- roc(test_data$fatal_incident, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve", col = "blue")
auc_value <- auc(roc_curve)
cat("AUC:", auc_value, "\n")
## AUC: 0.9132353
# Predict on test data
predicted_probs <- predict(logistic_model, newdata = test_data, type = "response")
# Convert predicted probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)
# Predict on test data
predicted_probs <- predict(logistic_model, newdata = test_data, type = "response")
# Convert predicted probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)
# Evaluate the model performance
confusion_matrix <- table(predicted_classes, test_data$fatal_incident)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
# Print the confusion matrix and accuracy
print(confusion_matrix)
##
## predicted_classes 0 1
## 0 3 0
## 1 9 85
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.907216494845361"
P=eβ0+β1X1/1+eβ0+β1X1P=eβ0+β1X1/1+eβ0+β1X1 # Get the coefficients of the logistic regression model
coefficients <- coef(logistic_model)
print(coefficients)
## (Intercept) age
## -5.4960585 0.1436353
## genderMale genderMale/Female
## 21.2201803 39.8907200
## genderUnknown raceBlack American or African American
## 19.6624353 -19.7136216
## raceLatino raceNative American or Alaska Native
## -2.2455087 19.2921908
## raceOther raceUnknown
## -17.1013968 -19.1823710
## raceWhite American or European American mental_health_issuesUnknown
## -17.4790179 0.5358593
## mental_health_issuesYes
## 37.3805239